Goto

Collaborating Authors

 new insight


e61eaa38aed621dd776d0e67cfeee366-AuthorFeedback.pdf

Neural Information Processing Systems

This relationship is obvious if the transition and reward factorizations are the same, namely X[Ii] = X[Ji]5 for all i [m], in which case the FMDP has m independent components. The remarkable aspect here is that such6 a relationship holds, even if the transition and reward factorizations differ arbitrarily. To summarize the insight, in the long run, different growth rates of the counters reflect different importance of the23 components towards maximizing cumulative rewards, and early on, their growth can suffer large variance. Intuition: please see our Response 2.1 for an intuitive explanation regarding why we need the37 cross-component bonuses. Moreover, these cross-component bonuses offer new insight (see our Response 2.1).



Understanding the Differences in Foundation Models: Attention, State Space Models, and Recurrent Neural Networks

Neural Information Processing Systems

Softmax attention is the principle backbone of foundation models for various artificial intelligence applications, yet its quadratic complexity in sequence length can limit its inference throughput in long-context settings. To address this challenge, alternative architectures such as linear attention, State Space Models (SSMs), and Recurrent Neural Networks (RNNs) have been considered as more efficient alternatives. While connections between these approaches exist, such models are commonly developed in isolation and there is a lack of theoretical understanding of the shared principles underpinning these architectures and their subtle differences, greatly influencing performance and scalability. In this paper, we introduce the Dynamical Systems Framework (DSF), which allows a principled investigation of all these architectures in a common representation.


New Insight into Hybrid Stochastic Gradient Descent: Beyond With-Replacement Sampling and Convexity

Neural Information Processing Systems

As an incremental-gradient algorithm, the hybrid stochastic gradient descent (HSGD) enjoys merits of both stochastic and full gradient methods for finite-sum minimization problem. However, the existing rate-of-convergence analysis for HSGD is made under with-replacement sampling (WRS) and is restricted to convex problems. It is not clear whether HSGD still carries these advantages under the common practice of without-replacement sampling (WoRS) for non-convex problems. In this paper, we affirmatively answer this open question by showing that under WoRS and for both convex and non-convex problems, it is still possible for HSGD (with constant step-size) to match full gradient descent in rate of convergence, while maintaining comparable sample-size-independent incremental first-order oracle complexity to stochastic gradient descent. For a special class of finite-sum problems with linear prediction models, our convergence results can be further improved in some cases. Extensive numerical results confirm our theoretical affirmation and demonstrate the favorable efficiency of WoRS-based HSGD.



New Insights into Automatic Treatment Planning for Cancer Radiotherapy Using Explainable Artificial Intelligence

Abrar, Md Mainul, Jia, Xun, Chi, Yujie

arXiv.org Artificial Intelligence

Objective: This study aims to uncover the opaque decision-making process of an artificial intelligence (AI) agent for automatic treatment planning. Approach: We examined a previously developed AI agent based on the Actor-Critic with Experience Replay (ACER) network, which automatically tunes treatment planning parameters (TPPs) for inverse planning in prostate cancer intensity modulated radiotherapy. We selected multiple checkpoint ACER agents from different stages of training and applied an explainable AI (EXAI) method to analyze the attribution from dose-volume histogram (DVH) inputs to TPP-tuning decisions. We then assessed each agent's planning efficacy and efficiency and evaluated their policy and final TPP tuning spaces. Combining these analyses, we systematically examined how ACER agents generated high-quality treatment plans in response to different DVH inputs. Results: Attribution analysis revealed that ACER agents progressively learned to identify dose-violation regions from DVH inputs and promote appropriate TPP-tuning actions to mitigate them. Organ-wise similarities between DVH attributions and dose-violation reductions ranged from 0.25 to 0.5 across tested agents. Agents with stronger attribution-violation similarity required fewer tuning steps (~12-13 vs. 22), exhibited a more concentrated TPP-tuning space with lower entropy (~0.3 vs. 0.6), converged on adjusting only a few TPPs, and showed smaller discrepancies between practical and theoretical tuning steps. Putting together, these findings indicate that high-performing ACER agents can effectively identify dose violations from DVH inputs and employ a global tuning strategy to achieve high-quality treatment planning, much like skilled human planners. Significance: Better interpretability of the agent's decision-making process may enhance clinician trust and inspire new strategies for automatic treatment planning.


Babylonian text missing for 1,000 years deciphered with AI

Popular Science

Breakthroughs, discoveries, and DIY tips sent every weekday. A team of ancient literature experts have deciphered a Mesopotamain text that was missing for over 1,000 years. Etched on clay tablets, the Hymn to Babylon describes the ancient megacity in "all of its majesty," and gives new insights into the everyday lives of those who resided there. The text is detailed in a study published in the journal Iraq. Founded in Mesopotamia around 2,000 BCE, Babylon was once the largest city in the world.


Understanding the Differences in Foundation Models: Attention, State Space Models, and Recurrent Neural Networks

Neural Information Processing Systems

Softmax attention is the principle backbone of foundation models for various artificial intelligence applications, yet its quadratic complexity in sequence length can limit its inference throughput in long-context settings. To address this challenge, alternative architectures such as linear attention, State Space Models (SSMs), and Recurrent Neural Networks (RNNs) have been considered as more efficient alternatives. While connections between these approaches exist, such models are commonly developed in isolation and there is a lack of theoretical understanding of the shared principles underpinning these architectures and their subtle differences, greatly influencing performance and scalability. In this paper, we introduce the Dynamical Systems Framework (DSF), which allows a principled investigation of all these architectures in a common representation. For instance, we compare linear attention and selective SSMs, detailing their differences and conditions under which both are equivalent.


Reviews: Multi-Criteria Dimensionality Reduction with Applications to Fairness

Neural Information Processing Systems

Originality: The authors' solve a problem left open in a previous paper and made strictly improve on previous work for approximation algorithms. They do so by giving new insights into structural properties of extreme points of semi-definite programs and more general convex programs. As far as I understand, the algorithms presented in the paper are not substantially new, but the analysis of these algorithms is novel. Quality: The paper seems complete and the proofs appear correct. The paper tackles interesting problems and reaches satisfying conclusions. They essentially close the book on the k 2 case, make significant improvements for k 2 and leave open some questions for structured data.


Reviews: Theoretical Analysis of Adversarial Learning: A Minimax Approach

Neural Information Processing Systems

Originality: I find the approach original and interesting, I find that other works have been cited and the section of related work is written clearly and detailed, it gives a nice overview. I think only that it is important to highlight more clearly the differences between [40] and the current work. In particular, it is unclear what is the penalty parameter, and how their method of adversarial training relates to this work - do they optimize a different bound or what quantities do they optimize, and do these quantities show up in the proposed bound? Quality: the work seems complete, and sound for as far as I could check. I could not check all the proofs in detail but I read the work in great detail.